Locating previously unknown patterns in data-mining results: a dual data- and knowledge-mining method

نویسندگان

  • Mir S. Siadaty
  • William A. Knaus
چکیده

BACKGROUND Data mining can be utilized to automate analysis of substantial amounts of data produced in many organizations. However, data mining produces large numbers of rules and patterns, many of which are not useful. Existing methods for pruning uninteresting patterns have only begun to automate the knowledge acquisition step (which is required for subjective measures of interestingness), hence leaving a serious bottleneck. In this paper we propose a method for automatically acquiring knowledge to shorten the pattern list by locating the novel and interesting ones. METHODS The dual-mining method is based on automatically comparing the strength of patterns mined from a database with the strength of equivalent patterns mined from a relevant knowledgebase. When these two estimates of pattern strength do not match, a high "surprise score" is assigned to the pattern, identifying the pattern as potentially interesting. The surprise score captures the degree of novelty or interestingness of the mined pattern. In addition, we show how to compute p values for each surprise score, thus filtering out noise and attaching statistical significance. RESULTS We have implemented the dual-mining method using scripts written in Perl and R. We applied the method to a large patient database and a biomedical literature citation knowledgebase. The system estimated association scores for 50,000 patterns, composed of disease entities and lab results, by querying the database and the knowledgebase. It then computed the surprise scores by comparing the pairs of association scores. Finally, the system estimated statistical significance of the scores. CONCLUSION The dual-mining method eliminates more than 90% of patterns with strong associations, thus identifying them as uninteresting. We found that the pruning of patterns using the surprise score matched the biomedical evidence in the 100 cases that were examined by hand. The method automates the acquisition of knowledge, thus reducing dependence on the knowledge elicited from human expert, which is usually a rate-limiting step.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

ارائه مدلی برای استخراج اطلاعات از مستندات متنی، مبتنی بر متن‌کاوی در حوزه یادگیری الکترونیکی

As computer networks become the backbones of science and economy, enormous quantities documents become available. So, for extracting useful information from textual data, text mining techniques have been used. Text Mining has become an important research area that discoveries unknown information, facts or new hypotheses by automatically extracting information from different written documents. T...

متن کامل

Data Mining System, Functionalities and Applications: A Radical Review

Data Mining is the process of locating potentially practical, interesting and previously unknown patterns from a big volume of data. It plays an important role in result orientation. Data mining can be used in each and every aspect of life. The same is similarly significant in other areas including sales/ marketing, revenue services, sports, health care and insurance etc. The said paper implies...

متن کامل

خوشه‌بندی اسناد مبتنی بر آنتولوژی و رویکرد فازی

Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...

متن کامل

Data Mining Performance in Identifying the Risk Factors of Early Arteriovenous Fistula Failure in Hemodialysis Patients

Background and Objectives: Arteriovenous fistula is a popular vascular access method for surgical treatment of hemodialysis patients. The method, however, is associated with a high rate of early failure varying in the range of 20-60%. Predicting early Arteriovenous fistula failure and its risk factors can help reduce its incidence, its hospitalization rate, and associated costs. In this study, ...

متن کامل

Data Mining: A Novel Outlook to Explore Knowledge in Health and Medical Sciences

Today medical and Healthcare industry generate loads of diverse data about patients, disease diagnosis, prognosis, management, hospitals’ resources, electronic patient health records, medical devices and etc. Using the most efficient processing and analyzing method for knowledge extraction is a key point to cost-saving in clinical decision making. Data mining, sometimes called data or knowledge...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • BMC Medical Informatics and Decision Making

دوره 6  شماره 

صفحات  -

تاریخ انتشار 2006